In this part of this study, for initial modeling and analysis, I will be looking at the total number of thefts from January 2001 - March 11, 2023 that were reported. Note that due to the large size of the original dataset (nearly 8 million rows), the raw data is not included in this repository. The raw data can be accessed here. GitHub repository, including code for getting the variables: here.
For this partition of the data, there are two variables: year/month and the number of thefts reported in each month. The full dataset has more variables, which are described below. Each row in the full dataset represents an individual crime that was reported.
| ID | Unique identifier for the record. |
|---|---|
| Case number | Chicago id for the case number |
| Date | Date when the incident occurred, this is sometimes a best estimate. |
| Block | The partially redacted address where the incident occurred. |
| IUCR | The Illinois Uniform Crime Reporting Code. |
| Primary Type | The primary description of the IUCR code. |
| Description | The secondary description of the IUCR code. |
| Location description | The primary description of the location where the incident occurred. |
| Arrest | Whether or not the incident resulted in an arrest. |
| Domestic | Whether or not the incident was a domestic incident. |
| Beat | Indicates the beat where the incident occurred. |
| District | The police district where the incident occurred. |
| Ward | The city council district where the incident occurred. |
| Community Area | The community area where the incident occurred. |
| FBI Code | FBI Code crime classification. |
| X Coordinate | The X coordinate location where the incident occurred. |
| Y coordinate | The Y coordinate where the incident occurred. |
| Year | Year the incident occurred. |
| Updated on | Date and time the record was last updated. |
| Latitude | The latitude where the incident occurred. |
| Longitude | The longitude where the incident occurred. |
| Location | The location of the incident. |
The population data for this project was extracted using the US Census API ACS, which provided population data from 2005-2019 from Cook County. Since Cook County is the base of the Chicago metropolitan area, I chose to extract this data, and include a variable that scaled the number of thefts over time by the yearly population. So far, I haven’t been able to track down monthly population data for the Chicago area. Note that also the data from 2001-2005 is harder to access as it is only available through the population estimates API. Therefore, although the original dataset goes from 2001-2023, at this time, I am focusing on 2005-2019. The population from the Chicago website is limited to 2018-2022.
The unemployment data is from the Illinois government website and focuses specifically on the Chicago metropolitan area. Although there are several variables related to Unemployment, for now, I will only focus on the unemployment rate for Chicago. Future work may extend this to more variables, such as the labor force participation rate.
thefts <- read_csv("data/thefts_unemployment_population.csv") %>%
mutate(Month = yearmonth(Month)) %>%
as_tsibble(index = Month)
## Rows: 144 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Month
## dbl (10): NumThefts, Population, NumTheftsPer10000, Labor Force, Labor Force...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(thefts)
## # A tsibble: 6 x 11 [1M]
## NumThefts Month Population NumThe…¹ Labor…² Labor…³ Emplo…⁴ Emplo…⁵ Unemp…⁶
## <dbl> <mth> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6075 2010 Jan 5394954 11.3 3735500 65.8 3280400 57.8 455100
## 2 4803 2010 Feb 5394954 8.90 3765000 66.3 3317200 58.5 447800
## 3 6085 2010 Mar 5394954 11.3 3772300 66.5 3341200 58.9 431100
## 4 6135 2010 Apr 5394954 11.4 3783300 67.3 3361700 59.8 421600
## 5 6722 2010 May 5394954 12.5 3756900 66.8 3358500 59.7 398500
## 6 6912 2010 Jun 5394954 12.8 3800600 67.5 3383000 60.1 417600
## # … with 2 more variables: `Unemployment Rate` <dbl>, `IL Rate` <dbl>, and
## # abbreviated variable names ¹NumTheftsPer10000, ²`Labor Force`,
## # ³`Labor Force Participation Rate`, ⁴Employed,
## # ⁵`EmploymentParticipation Rate`, ⁶Unemployed
From the original number of thefts, there appears to be a decreasing trend, although not quite a linear one. There doesn’t appear to be a cycle, although it’s possible with more data, there would be one. There is a clear seasonal trend of a peak in summer followed bso y a dip in January. Although the data is decreasing generally, there is a really sharp decrease in the number of thefts since 2020, likely reated to the pandemic. The data does not have constant variance: there appears to be less variance as the data continues on.
lambda <- thefts |>
features(NumThefts, features = guerrero) |>
pull(lambda_guerrero)
lambda
## [1] 0.6609234
thefts %>%
autoplot(NumThefts)
### Raw Number of Thefts - Box Cox Transformation
With a box cox transformation, the data appears to have more stable variance. The seasonal trend is still prominent, as is the general trend of falling with a steep drop in the years 2020-2022.
lambda <- thefts |>
features(NumTheftsPer10000, features = guerrero) |>
pull(lambda_guerrero)
lambda
## [1] 0.6775376
thefts <- thefts %>%
mutate(TheftsPer10000Transformed = box_cox(NumTheftsPer10000, lambda))
thefts <- thefts %>%
mutate(NumTheftsTransformed = box_cox(NumThefts, lambda))
thefts %>%
autoplot(NumTheftsTransformed)
thefts %>%
gg_season(NumThefts)
The data appear to all have a really strong seasonal pattern, although
the scale is significantly lower in 2021 and 2022, and is higher in 2012
and 2015. Note that in the next plot, we can see that the population of
Chicago has been decreasing over time, which may partially explain this.
Typically, there is a drop in thefts in the winter, and a peak in thefts
around late spring and early fall, with it being highest in August.
thefts %>%
gg_subseries(NumThefts)
This pattern is easier to see in the subseries plot, where clearly the
number of thefts is highest between June and August, and lowest in
February. Throughout all of the plots, we can see the number of thefts
is decreasing over time, though note that the pandemic may be partially
responsible, as well as the decline in population. ## Autocorrelation of
the Number of Thefts {.tabset} In both the transformed and the original
data, in general the autocorrelation is strictly positive, although it
follows a general decreasing, then increasing, then decreasing trend,
with the exception of the pandemic, where the autocorrelation of the
number of thefts appearing to decrease, and then increase. The
decreasing autocorrelation plots are not statistically significant,
however almost all of the positive autocorrelations are statistically
significant, quite so. The transformed number of thefts also follow this
pattern, but to a lesser extent. ### ACF Plot of the Number of
Thefts
thefts %>%
ACF(NumThefts) %>%
autoplot()
thefts %>%
ACF(NumTheftsTransformed) %>%
autoplot()
thefts %>%
autoplot(NumTheftsPer10000)
thefts %>%
autoplot(TheftsPer10000Transformed)
Although the transformed number of thefts follows a similar decreasing
pattern, it’s less extreme than in the non-transformed data. It is
highest around 2010, with again a sharp dip in winter and a peak in
summer.
thefts %>%
gg_season(NumTheftsPer10000)
thefts %>%
gg_subseries(NumTheftsPer10000)
# Plotting the Population Over the 10-Year Period There appears to be a
strong quadratic trend to the population data from 2010-2022. There is
no cyclicity, and since it is not seasonal data, there is also no
seasonality. The variance appears pretty constant, so I don’t think a
box cox transformation is necessary.
population <- read_csv("data/all_chicago_pop_estimates.csv") %>%
as_tsibble(index = Year)
## Rows: 10 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): Year, Population
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
population %>%
autoplot(Population)
Initially, I will test the four simple models that we’ve covered in class: naive, seasonal naive, mean, and random walk with drift. There appears to be a stronger seasonal component than trend component, so my guess is that seasonal naive will probably perform the best, but I will test all 4 models with a test set and predicted set.
Forecasting the number of thefts has kind of mixed results. Although there is a clear seasonal pattern with the number of thefts increasing during the summer, the pandemic in the last few years totally changed the results. However, I will still try to predict the future given all the four models in class, first with a training/test set, then with cross validation #### Model
thefts_train <- thefts %>%
filter(year(Month) < 2019)
raw_count.model <- thefts_train %>%
model(
naive = NAIVE(NumThefts),
snaive = SNAIVE(NumThefts),
mean = MEAN(NumThefts),
lm = RW(NumThefts ~ drift()),
)
raw_count.model.forecast <- raw_count.model %>%
forecast(h = "3 years")
raw_count.model.forecast %>%
autoplot(filter(thefts, year(Month) >= 2015))
raw_count.model.forecast %>%
accuracy(thefts)
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lm Test -1608. 1899. 1655. -49.8 50.6 4.43 4.07 0.906
## 2 mean Test -1624. 1924. 1674. -50.4 51.2 4.48 4.12 0.908
## 3 naive Test -1675. 1968. 1719. -51.7 52.5 4.60 4.21 0.908
## 4 snaive Test -1424. 1773. 1467. -43.4 44.4 3.92 3.80 0.876
The root mean squared error is fairly high for all of these. Note that the scale of the data is around 5000 thefts per month, so the root mean squared error is very high relative to the data. However, the best performing model appears to be the seasonal naive, which captures the actual pattern.
Secondly, I will look at the plot of the residuals of the number of thefts.
raw_count.model <- thefts_train %>%
model(
snaive = SNAIVE(NumTheftsTransformed)
)
raw_count.model.forecast <- raw_count.model %>%
forecast(h = "3 years")
raw_count.model.forecast %>%
autoplot(thefts)
raw_count.model %>%
gg_tsresiduals()
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
There is strong autocorrelation in the ACF plot. The residuals appear to
be left skewed, so not normally distributed. The residuals do not appear
to be white noise.
raw_count.model.forecast %>%
autoplot(thefts)
percapita.model <- thefts_train %>%
model(
naive = NAIVE(NumTheftsPer10000),
snaive = SNAIVE(NumTheftsPer10000),
mean = MEAN(NumTheftsPer10000),
lm = RW(NumTheftsPer10000 ~ drift()),
)
percapita.model.forecast <- percapita.model %>%
forecast(h = "3 years")
percapita.model.forecast %>%
autoplot(thefts)
percapita.model.forecast %>%
accuracy(thefts)
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lm Test -2.89 3.43 2.98 -47.9 48.7 4.26 3.96 0.903
## 2 mean Test -2.87 3.44 2.98 -47.8 48.8 4.26 3.97 0.906
## 3 naive Test -3.02 3.56 3.10 -49.9 50.6 4.43 4.11 0.906
## 4 snaive Test -2.55 3.21 2.66 -41.5 42.8 3.80 3.70 0.874
We see similar results with the scaled thefts, however as we would expect with the scaling, the shape of the number of thefts is slightly different: obviously the scale is much smaller, and there also appears to be less variance. Again, none of the models appear to fit the data that well, although the seasonal naive appears to do the best. Again, all of the models have relatively high counts in proportion to the data.
Since the data for the number of thefts per 10,000 residents of Chicago needed a box cox transformation, I will test whether that improved the model performance.
percapita.model.t <- thefts_train %>%
model(
naive = NAIVE(TheftsPer10000Transformed),
snaive = SNAIVE(TheftsPer10000Transformed),
mean = MEAN(TheftsPer10000Transformed),
lm = RW(TheftsPer10000Transformed ~ drift()),
)
percapita.model.forecast <- percapita.model.t %>%
forecast(h = "3 years")
percapita.model.forecast %>%
autoplot(thefts)
percapita.model.forecast %>%
accuracy(thefts)
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lm Test -1.47 1.76 1.51 -42.4 43.0 4.53 4.25 0.899
## 2 mean Test -1.44 1.75 1.49 -41.7 42.6 4.48 4.22 0.902
## 3 naive Test -1.53 1.82 1.57 -43.9 44.6 4.70 4.40 0.902
## 4 snaive Test -1.30 1.63 1.35 -36.7 37.9 4.05 3.95 0.878
percapita.model.forecast %>%
accuracy(thefts)
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lm Test -1.47 1.76 1.51 -42.4 43.0 4.53 4.25 0.899
## 2 mean Test -1.44 1.75 1.49 -41.7 42.6 4.48 4.22 0.902
## 3 naive Test -1.53 1.82 1.57 -43.9 44.6 4.70 4.40 0.902
## 4 snaive Test -1.30 1.63 1.35 -36.7 37.9 4.05 3.95 0.878
As I expected, the best performing model appears to be the seasonal naive model, across all measures. Now, I will look at the residuals to see whether they look like white noise. There is still significant autocorrelation across much of the seasonal data. However, there appear to be fewer points with significant autocorrelation. The residuals are more clearly left skewed here.
percapita <- thefts %>%
model(
snaive = SNAIVE(NumTheftsPer10000)
)
percapita %>%
gg_tsresiduals()
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
raw_count.t.snaive <- thefts %>%
model(
snaive = SNAIVE(TheftsPer10000Transformed)
)
raw_count.t.snaive %>%
gg_tsresiduals()
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
There is still significant autocorrelation in these residuals, and the distribution looks left skewed. Therefore, the seasonal naive method is not appropriate for predicting the number of thefts over time.
This variable, taken from the Illinois website, explores the un-seasonally adjusted unemployment rate in Chicago. Generally, there appears to be a decreasing trend among the unemployment from January 2015 to December 2022. However, again the pandemic significantly changed the unemployment rate: there is a dip in 2020, followed by a sharp peak afterwards, likely from people losing their jobs from the pandemic. The data is not seasonally adjusted, and there is a clear seasonal trend to the data of a spike every few months, particularly January. ### Original Variable
thefts %>%
autoplot(`Unemployment Rate`)
lambda <- thefts |>
features(NumThefts, features = guerrero) |>
pull(lambda_guerrero)
thefts <- thefts %>%
mutate(UnemploymentRateTransformed = box_cox(`Unemployment Rate`, lambda))
thefts %>%
autoplot(UnemploymentRateTransformed)
The transformed unemployment rate looks similar, although not identical. The variance is somewhat more stable, but the major issue is still the spike in 2020 of unemployment.
thefts %>%
gg_season(`Unemployment Rate`)
There appears to typically be a small spike in unemployment in January, and peaks dring June. However, again 2020 is very different from the rest of the data, with a gradual increase from February to March, followed by a sharp peak in April and remains relatively high throughout 2020.
The lag plots show a strong autocorrelation, so this data is likely not white noise.
thefts %>%
gg_lag(`Unemployment Rate`, lags = 1:12)
thefts %>%
gg_season(`Unemployment Rate`)
thefts %>%
gg_subseries(`Unemployment Rate`)
From the subseries plot, we can see that typically unemployment was decreasing in all months, with the exception of 2020, where it remains consistently much higher.
For this variable, I have chosen to do cross validation to select the model, then do a smaller training and test set once the general model has been evaluated. Again, I expect worse performance in 2020 than in other years.
fc <- thefts %>%
stretch_tsibble(.init = 3, .step = 1) %>%
filter(.id < max(.id)) %>%
model(
snaive = SNAIVE(`Unemployment Rate`),
lm = RW(`Unemployment Rate` ~ drift()),
mean = MEAN(`Unemployment Rate`),
naive = NAIVE(`Unemployment Rate`),
) %>% forecast(h = 3) |>
group_by(.id) |>
mutate(h = row_number()) |>
ungroup() |>
as_fable(response = "Unemployment Rate", distribution = `Unemployment Rate`)
fc %>%
accuracy(thefts)
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
## 2 observations are missing between 2022 Jan and 2022 Feb
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lm Test 0.0846 1.75 0.816 -0.253 10.6 0.476 0.579 0.664
## 2 mean Test -1.74 2.87 2.35 -38.2 42.8 1.37 0.949 0.848
## 3 naive Test -0.0988 1.73 0.820 -3.01 10.9 0.478 0.572 0.663
## 4 snaive Test -0.398 3.04 1.73 -11.8 24.0 1.01 1.00 0.827
From cross validation, the bet performing model varies based on the metric. However, overall, the naive model and the random walk with drift appear to perform the best. There is a linear trend across most of the data, and the seasonal pattern exists but doesn’t appear to be that significant.
thefts |>
model(RW(`Unemployment Rate` ~ drift())) |>
gg_tsresiduals()
## Warning: Removed 1 row containing missing values (`geom_line()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
With these residuals, overall the ACf plot doesn’t show any significant autocorrelation. However, the residuals do appear skewed and not normally distributed, and again there is that sharp peak in 2020.